Extractive QA

2025年8月2日

15:24

What Is Extractive Question Answering?

Chatbots and generative models are widely adopted in different applications today, but the information they provide is not always reliable. (You can read more about how we use them in our post Enhancing Knowledge Discovery: Implementing Retrieval Augmented Generation with Ontotext Technologies.) Usually, they operate in a “closed book” setting, which means that such systems are trying to provide an answer based on the general knowledge about the world.

However, in sensitive domains and applications such answers are not reliable enough. We would like to be able to extract the information from the source as if highlighting relevant passages or phrases in a book. The latter approach is called Extractive Question Answering (QA) and although it’s not talked about as often nowadays as the generative approach to the problem, it can provide valuable insights and help structure data.

We would not yet trust large language models (LLM) to answer such sensitive questions, but finding the exact response in the trustworthy source is quite helpful in this case.

In contrast to Generative QA that tries to formulate the answer with its own words, Extractive QA is like highlighting the answer in the given text. From a practical perspective, applying generative models to such tasks is more expensive in terms of computation and less precise due to the non-deterministic nature of its output.

The task of Extractive QA requires a deeper understanding of the context to identify the most relevant passage that directly addresses the user’s question. The scope of the questions is almost unlimited as long as the answer is present in the text. If it isn’t, the model will return an empty output. The resulting pipeline can be easily applied to the questions that were not seen during the training phase. With a single model we can extract any kind of information within the same domain. It’s usually robust with respect to the way the questions are formulated, allowing users smoother interaction.

transformers.pipeline

task (str) — The task defining which pipeline will be returned. Currently accepted tasks are:

其中和extractive qa相关的有：

"document-question-answering": will return a DocumentQuestionAnsweringPipeline.

"question-answering": will return a QuestionAnsweringPipeline.

"table-question-answering": will return a TableQuestionAnsweringPipeline.

from transformers import pipeline

oracle = pipeline(model="google/tapas-base-finetuned-wtq")

table = {

"Repository": ["Transformers", "Datasets", "Tokenizers"],

"Stars": ["36542", "4512", "3934"],

"Contributors": ["651", "77", "34"],

"Programming language": ["Python", "Python", "Rust, Python and NodeJS"],

}

oracle(query="How many stars does the transformers repository have?", table=table)